Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Random Forest Step #46

Draft
wants to merge 28 commits into
base: dev
Choose a base branch
from
Draft

Add Random Forest Step #46

wants to merge 28 commits into from

Conversation

Gitiauxx
Copy link

@Gitiauxx Gitiauxx commented Sep 24, 2018

This PR adds a template for RandomForestStep.

The template creates automatically a random forest model using the sklearn libary with a fit and run The code can be found in regression.RandomForestStep.

For an example with real data, see machine_learning_steps_rental_prices

The model can be saved in the configs directory as a yaml file and a pickled file.

Usage
A random forest is created as follows:
m = RandomForestRegressionStep()

RandomForestRegressionStep takes thew following inputs:

  • a model expression: formula with lhs and rhs sides separated by '~'

m.model_expression = '... ~ ...'

  • a table registered in orca

m.tables = '...'

  • a model name

m.name = '....'

The main difference with previous steps is that the model, once built, cannot be saved completely in a yaml file and has to be pickled:

  • To save a random forest model, use the register method from modelmanager.py:
    modelmanager.register(m)
    which will add in orca a step with name m.name and pickle the model in a file saved in the configs directory.

  • To load a model that has been saved in the configs directory, use the initialize methods from the modelmanager.py:

modelmanager.initialize()

which will unpickle the model and turn it back to an object RandomForestStep.

Fitting and cross-validation
The random forest step includes some validation metrics:

  • feature importance to help identify main explanatory variables via the command

m.importance

  • cross validation: using k-fold techniques, hold out one fold, train the model on k-1 folds and test on the held out fold.

m.cross_validate_scorer(nsplits=k)

returns the mean of the errors across all folds.

Additional libraries
RandomForestStep is built on top of sklearn machine learninglibrary. Version 0.19.2 or more recent is required. Moreover to efficiently save and load complicated random forest models, modelmanager provides a pickle./unplickle utility based on dill (version 0.2.8.2 or more recent).

@Gitiauxx Gitiauxx self-assigned this Sep 24, 2018
@Gitiauxx Gitiauxx requested a review from smmaurer September 26, 2018 22:06
@Gitiauxx Gitiauxx added the enhancement New feature or request label Sep 26, 2018
@smmaurer
Copy link
Member

smmaurer commented Oct 15, 2018

Hi @Gitiauxx,

Apologies, I'm just getting a chance to review this. Generally looks great; here are some initial comments:

  1. Looks like your text editor is mixing tabs and spaces, which makes some of the code hard to read on Github -- standard Python style is spaces only, which you can probably switch to in a setting. (PEP 8)

I believe it is fixed.

  1. If it's not too much trouble, let's move the neural network template out of this PR and into a subsequent one, just to keep things simpler.

Done.

  1. For _get_input_columns(): generally, we're trying to support Patsy-style model expressions, and use Patsy to parse them. Maybe there's a Patsy function to get the names of input columns? An advantage is it would automatically support inline transformations.

Yes and no:
The initial issue is that sklearn-based models (and probably other estimations libraries) do not rely on patsy formulas as inputs. This is an important difference with statsmodel besed models.

However, I do not need _get_input_columns() anymore because I created a utility that takes patsy formula and generates the corresponding inputs for sklearn models (see from_patsy_to_array in utils.py). It also creates a list of variable names.

Related to the issue of generalizing away from statsmodel based steps, see #43 with explanation for why we need a utility like convert_to_model to convert sklearn models' fit and predict methods into methods that mimic what statsmodel fit and predict methods do.

  1. Can you add docstrings for all of your utilities, with detailed descriptions of the what the functions do, the parameters they take, and their return values?

Absolutely. I believe I went over all of them. Let me know if the format works for you.

  1. Looks like there are duplicate modelmanager imports now in binary_logit.py, large_multinomial_logit.py, and small_multinomial_logit.py, probably from merging master into this branch or something.

Fixed. Probably my mistake when I merged master with another branch.

  1. You can update the library version number to 0.1.dev16 in setup.py and urbansim_templates/__init__.py. (I figure we'll merge PR [0.1.dev15] Updating Large MNL template with better simulation functionality #51 before this one.)

Done.

  1. Could you make a note in the PR writeup about the additional libraries (and their version numbers) that these templates require? We should add them to setup.py and also to our conda environments for working with the templates (eg here).

Done.

  1. I haven't tried running your tests, but are you using pytest? It's a useful utility that makes it easier to run a whole suite of tests at once, either locally or on a continuous integration server (which we'll set up soon for this library). You can run it from the command line with pytest *.py. See my recent updates to test_large_multinomial_logit.py for an example of some of the features. You can use fixtures for data setup instead of __main__.

@smmaurer
Copy link
Member

Awesome, thank you for all the fixes! This looks great.

I left some related comments in issues #43 and #50 too.

The functionality of the templates looks good, and the cross-validation is going to be really helpful.

A couple more issues with the tests:

The cross-validation and gradient boosting tests use data files that aren't committed to the repo. (I've just been using fake data for tests so far, but real data seems great.) I forget, are the files small enough that we can just add them to the repo? Under 1 MB would be great. If not, would the tests still work with smaller excerpts of the data?

The file paths in those tests are also Windows-specific; you can use os.path.join() to make them cross-platform.

And you can move the neural network tests to the branch with that model.

Once the tests run on my machine, let's merge this!

@smmaurer
Copy link
Member

Just took a look at the merge conflicts (easy to fix in the web interface)..

  • setup.py: we should keep your version line
  • test_large_multinomial_logit.py: keep the length assertion line from master, otherwise keep additions from both branches
  • utils.py: keep additions from both branches

@Gitiauxx
Copy link
Author

Thanks @smmaurer

I will go over and develop a comprehensive battery of tests for this PR using pytest.

@Gitiauxx
Copy link
Author

@smmaurer

I start running a battery of tests (see most commits) for the code I wrote. So far they have passed on my machine (windows). You may want to try them on a mac.

@smmaurer
Copy link
Member

@Gitiauxx Looking good, most of them are passing for me too.

Cross validation is failing with AttributeError: 'RandomForestRegressionStep' object has no attribute '_get_input_columns'

Looking at the yaml for Random Forest, i noticed a couple of things:

  • let's use consistent naming for the keys, so cross validation metric -> cross_validation_metric, features importance -> feature_importance
  • it's saving the model step name twice, in the name key (autogenerated by modelmanager) and also in the model key (generated in the template definition); the latter might not be needed?

@Gitiauxx
Copy link
Author

@smmaurer

No problems.... I believe I fixed everything. cross_validate was failing because I forgot to push my last edit of shared.py...

@xyzjayne
Copy link

Thanks for creating this! Will there be a RF classifier on the way?

@smmaurer smmaurer marked this pull request as draft February 16, 2021 23:49
@smmaurer smmaurer changed the base branch from master to main February 16, 2021 23:49
@smmaurer smmaurer changed the base branch from main to dev February 16, 2021 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants